Skip to content

Telemetry: normalize host key for per-host client + breaker registries#364

Merged
samikshya-db merged 1 commit into
mainfrom
telemetry-normalize-host-key
May 25, 2026
Merged

Telemetry: normalize host key for per-host client + breaker registries#364
samikshya-db merged 1 commit into
mainfrom
telemetry-normalize-host-key

Conversation

@samikshya-db
Copy link
Copy Markdown
Collaborator

Summary

  • clientManager and circuitBreakerManager keyed their maps by the raw host string, so DSN variants like example.com, example.com/, and https://example.com created separate telemetry clients and separate circuit breakers for the same logical host. That fragments trip state and defeats the per-host consolidation the managers are designed to do.
  • Add normalizeHostKey (lowercase, trim whitespace, strip http(s)://, trim trailing slashes) and use it for all registry lookups in clientManager.getOrCreateClient, clientManager.releaseClient, and circuitBreakerManager.getCircuitBreaker.
  • The host string flowing into URL construction is unchanged — first-caller-wins, matching the existing user-agent semantics documented on getOrCreateClient.

Addresses Gopal's review comment on #354: #354 (comment)... (host normalization on telemetry/exporter.go:62).

Test plan

  • TestNormalizeHostKey — covers scheme stripping, lowercase, trim trailing slash, whitespace, empty string
  • TestClientManager_HostVariantsShareClient — four DSN variants of the same host share a single telemetry client; release via a different variant still cleans up
  • TestCircuitBreakerManager_HostVariantsShareBreaker — same coverage for the breaker registry
  • Full go test ./telemetry/... passes

This pull request and its description were written by Isaac.

Both clientManager and circuitBreakerManager keyed their maps by the raw
host string, so DSN variants like "example.com", "example.com/", and
"https://example.com" produced separate telemetry clients and separate
circuit breakers for the same logical host. That fragments trip state and
defeats the per-host consolidation the managers are designed to do.

Introduce normalizeHostKey (lowercase, trim whitespace, strip http/https
scheme, trim trailing slashes) and use it for all registry lookups in
clientManager.getOrCreateClient, clientManager.releaseClient, and
circuitBreakerManager.getCircuitBreaker. The host string flowing into URL
construction is unchanged — first-caller-wins, matching existing
user-agent semantics documented on getOrCreateClient.

Co-authored-by: Isaac
Copy link
Copy Markdown

@jprakash-db jprakash-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@samikshya-db samikshya-db merged commit 7b961f2 into main May 25, 2026
2 of 3 checks passed
@samikshya-db samikshya-db deleted the telemetry-normalize-host-key branch May 25, 2026 07:17
@samikshya-db samikshya-db mentioned this pull request May 25, 2026
3 tasks
samikshya-db added a commit that referenced this pull request May 25, 2026
## Summary
Bump `DriverVersion` to `1.12.0` and add the `v1.12.0` section to
`CHANGELOG.md`.

### Notable changes since v1.11.1
- Retry transient S3 errors in CloudFetch downloads and staging
PUT/GET/REMOVE operations (#355, #361)
- Telemetry: normalize host key for per-host client + breaker
registries; stop retrying into 429s, honour Retry-After, fix userAgent
(#354, #364)
- Bump dependencies to clear Go-1.20-compatible CVEs: golang-jwt, x/net,
protobuf, go-jose v3.0.5 (CVE-2026-34986) (#360, #363)

## Test plan
- [x] `go test ./... -count=1 -short` passes locally
- [x] Multi-DBR tests in `universe/peco/correctness/go` passed (all 5
suites green against DBR 18.x-photon-scala2.13)
- [ ] CI green

This pull request was AI-assisted by Isaac.

Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
vikrantpuppala added a commit that referenced this pull request Jun 1, 2026
The sentinel Watch loop had a race: once StatusFn reported Done, it
spawned the OnDoneFn processor and looped back into its select with two
potentially-ready cases at once -- the processor result (resCh) and
ctx.Done(). Go's select picks randomly among ready cases, so a context
cancellation arriving at the same instant as completion could win and
trigger OnCancelFn, cancelling an operation that had already finished.

TestConn_ExecContext/"ExecContext uses new context to close operation"
cancels the context inside GetOperationStatus while returning FINISHED,
then asserted cancelOperationCount == 1 -- i.e. it asserted on the losing
side of that random select. The Go scheduler version only shifts the
probability; the race is in the design, not the language version. The
racy pattern dates to #41 (Nov 2022) and the test to #86 (Dec 2022); it
was not introduced by #364.

Fix: once Done is observed, the operation has logically completed and its
outcome is authoritative. Drain the OnDoneFn result via a new waitForDone
helper that selects only on resCh/errCh, no longer on ctx.Done()/timeout.
A finished operation is now never cancelled, deterministically.

Update the connection test to assert the correct deterministic behaviour
(success, cancelOperationCount == 0, CloseOperation still runs on a fresh
context) and add a sentinel regression test that pre-cancels the context
in the same StatusFn call that reports Done.

Verified: TestConn_ExecContext -count=300 green; sentinel suite green
under -race -count=5.

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants